Fix TROP bootstrap SE backend divergence under fixed seed#354
Merged
Conversation
Rust and Python TROP backends produced different bootstrap standard errors for the same `seed` value. On a tiny correlated panel under `seed=42` the gap was ~28% of SE: Rust seeded `rand_xoshiro:: Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), so identical seeds mapped to different bytestreams. Canonicalize on numpy. New `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` pre-generates per-replicate (control, treated) positional index arrays from a numpy `Generator` and hands them to both backends through the PyO3 surface — both Rust bootstrap functions (`bootstrap_trop_variance_global`, `bootstrap_trop_variance`) now accept `control_indices` and `treated_indices` as `i64` arrays in place of `seed: u64`. Parallelism is preserved. Sampling law (stratified: controls then treated, with replacement) is unchanged. Global-method SE is now backend-invariant under the same seed to machine precision: the prior `xfail(strict=True)` in `test_bootstrap_seed_reproducibility` is flipped to a passing `assert_allclose(atol=rtol=1e-14)` and parametrized over `[0, 42, 12345]`. A companion `test_bootstrap_seed_reproducibility_local` is added for the local-method bootstrap. It is currently `xfail(strict=True)` because aligning the RNG exposed two separate local-method backend divergences beyond this PR's scope: Rust's `compute_weight_matrix` normalizes time and unit weights to sum to 1, while Python's `_compute_observation_weights` does not; and the Python fallback's `_compute_observation_weights(_precomputed branch)` reads the original-panel cached `Y`/`D` instead of the bootstrap-sample arguments. Both are tracked as follow-up rows in `TODO.md` with file:line pointers and will land in a separate methodology PR. Closes the bootstrap half of silent-failures audit finding #23 (the grid-search half closed in PR #348). Reference: Athey, Imbens, Qu & Viviano (2025), "Triply Robust Panel Estimators", Algorithm 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b10e46c to
225674c
Compare
Owner
Author
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Highest unmitigated finding: Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 — Rust `bootstrap_trop_variance` and `bootstrap_trop_variance_global` now validate every element of `control_indices` and `treated_indices` before entering the parallel loop. Negative values and values `>= pool size` raise `PyValueError` with the offending index, rather than panicking on a negative-to-`usize` cast or indexing `original_*_units` out of bounds inside the rayon closure. P3 — `rust/src/trop.rs` function header docs rewritten for both bootstrap entry points. Removed the stale `seed` argument; added description of `control_indices` and `treated_indices` shape/dtype/ range, and documented the Python-canonical RNG contract via `diff_diff.bootstrap_utils.stratified_bootstrap_indices`. P3 — `diff_diff/trop_local.py` comment on the pre-generate block rewritten to state that local-method SE is NOT yet bit-identical across backends (only the RNG layer is aligned; two downstream divergences remain in `compute_weight_matrix` normalization and the Python `_compute_observation_weights` stale-cache path; tracked in `TODO.md`). P3 — New `tests/test_rust_backend.py` coverage: `TestTROPRustBackend::test_bootstrap_rejects_negative_index`, `test_bootstrap_rejects_out_of_range_index`, and the corresponding `TestTROPGlobalRustBackend` pair each assert `ValueError` with the expected regex for malformed inputs on both local and global bootstrap functions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Highest unmitigated severity: Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
igerber
added a commit
that referenced
this pull request
Apr 24, 2026
…+ Python cache-fallthrough Closes the local-method half of silent-failures audit finding #23 (RNG half closed in PR #354; grid-search half in PR #348). Two methodology fixes, both isolated to the local-method path — global is unaffected. 1. Rust weight-matrix normalization removed ------------------------------------------ `rust/src/trop.rs::compute_weight_matrix` no longer divides `time_weights` and `unit_weights` by their respective sums before the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu, Viviano 2025) and REGISTRY.md Requirements checklist (`[x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)`) both specify raw-exponential weights; Python's `_compute_observation_weights` was already REGISTRY-compliant. Rust's normalization inflated the effective nuclear-norm penalty relative to the data-fit term, changing the regularization trade-off. User-visible effect: Rust local-method ATT values may shift for fits with `lambda_nn < infinity`. For `lambda_nn = infinity` (factor model disabled) outputs are unchanged — uniform weight scaling leaves the minimum-norm WLS argmin invariant. Rust LOOCV-selected lambdas may also shift on that boundary; both backends now converge on the same selection. Affects both local-method Rust call sites (LOOCV at trop.rs:459, bootstrap at trop.rs:1096). 2. Python `_compute_observation_weights` cache-fallthrough removed --------------------------------------------------------------- Removed the `if self._precomputed is not None:` branch that silently substituted `self._precomputed["Y"]` / `["D"]` / `["time_dist_matrix"]` (original-panel cache populated during main fit) for the function-argument `Y, D`. Under bootstrap, `_fit_with_fixed_lambda` computes fresh `Y, D` from the resampled `boot_data` and passes them in; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct (always consumed `y_boot, d_boot`). Test changes ------------ - `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity:: test_bootstrap_seed_reproducibility_local`: flipped from `xfail(strict=True)` to passing `assert_allclose` at `atol=1e-5` across seeds `[0, 42, 12345]`. Residual ~1e-7 gap is Rust `estimate_model` vs numpy `lstsq` roundoff that accumulates differently across per-replicate bootstrap fits; follow-up TODO row tracks unifying Rust to the `solve_wls_svd` path (same SVD helper the global-method uses since PR #348) for sub-1e-14 parity. - New `test_local_method_main_fit_parity`: parametrized over `(lambda_nn=inf, atol=1e-14)` and `(lambda_nn=0.1, atol=1e-10)`; asserts `atol=1e-14` bit-identity for the main-fit ATT at `lambda_nn=inf` (the regression guard for the normalization fix) and `atol=1e-10` for the finite-`lambda_nn` FISTA path. Verification ------------ Targeted regression sweep — all green: - 9 `TestTROPRustEdgeCaseParity` tests (grid-search + global bootstrap × 3 seeds + local bootstrap × 3 seeds + local main-fit × 2 regimes) - Full `test_rust_backend.py` suite: 92 passed - Full `test_trop.py` suite under Rust backend: 120 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber
added a commit
that referenced
this pull request
Apr 24, 2026
P3 — the PR #354 [Unreleased] Fixed entry (line 21) said local-method bit-identity SE remained blocked by the Rust-normalization and Python cache-fallthrough divergences and was "tracked as a follow-up in TODO.md." With the two TROP-local Fixed entries that this PR adds (lines 22-27) closing exactly those divergences, the PR #354 tail sentence is now internally inconsistent with the surrounding entries. Rewritten to say the RNG half of finding #23 is closed here (bootstrap contract), grid-search half was closed in PR #348, and the local- method methodology half is closed by the two Fixed entries that follow in the same release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rust and Python TROP backends previously produced different bootstrap SE for the same
seed. On a tiny correlated panel underseed=42, the gap was ~28% of SE because Rust seededrand_xoshiro::Xoshiro256PlusPlusper replicate while Python consumednumpy.random.default_rng(PCG64), so identical seeds mapped to different bytestreams.This PR canonicalizes on numpy. A new
stratified_bootstrap_indiceshelper indiff_diff/bootstrap_utils.pypre-generates per-replicate(control, treated)positional index arrays from a numpyGenerator; both TROP bootstrap Rust functions now acceptcontrol_indices/treated_indicesasi64arrays in place ofseed: u64. Parallelism preserved; sampling law (stratified: controls then treated, with replacement) unchanged.test_bootstrap_seed_reproducibilityflipped fromxfail(strict=True)to passingassert_allclose(atol=rtol=1e-14)and parametrized over[0, 42, 12345].xfail(strict=True)with a new finding: aligning the RNG surfaced two separate local-method backend divergences (weight-matrix normalization incompute_weight_matrixis Rust-only; Python_compute_observation_weightsreadsself._precomputed["Y"]instead of the bootstrap-sampleY). Both tracked as follow-up rows inTODO.mdand will land in a separate methodology PR.Methodology references
numpy.random.default_rng(PCG64) is treated as canonical per the repo convention that numpy/scipy paths are the reference implementation for cross-backend parity.Validation
TestStratifiedBootstrapIndices(7 tests: shape/dtype, value range, determinism, prefix invariance, hard-coded value pin fordefault_rng(42), empty-pool edges) intests/test_bootstrap_utils.py.test_bootstrap_seed_reproducibility_localparametrized over three seeds intests/test_rust_backend.py.test_bootstrap_seed_reproducibilityflipped from xfail to parametrized passing regression guard (atol=rtol=1e-14, seeds[0, 42, 12345]). Four existing direct-call tests (test_bootstrap_variance_shape,test_bootstrap_reproducibility, and their global counterparts) updated to construct stratified index arrays via the new helper instead of passingseed.tests/test_rust_backend.pypasses (83 passed + 3 xfailed — the local-method seed-parity tests documenting the new finding).tests/test_trop.py -k bootstrappasses (13 tests) under both Rust and pure-Python (DIFF_DIFF_BACKEND=python) backends.Security / privacy
🤖 Generated with Claude Code